The search functionality is under construction.

Keyword Search Result

[Keyword] reinforcement learning(72hit)

61-72hit(72hit)

  • On the Effects of Domain Size and Complexity in Empirical Distribution of Reinforcement Learning

    Kazunori IWATA  Kazushi IKEDA  Hideaki SAKAI  

     
    PAPER-Artificial Intelligence and Cognitive Science

      Vol:
    E88-D No:1
      Page(s):
    135-142

    We regard the events of a Markov decision process as the outputs from a Markov information source in order to analyze the randomness of an empirical sequence by the codeword length of the sequence. The randomness is an important viewpoint in reinforcement learning since the learning is to eliminate the randomness and to find an optimal policy. The occurrence of optimal empirical sequence also depends on the randomness. We then introduce the Lempel-Ziv coding for measuring the randomness which consists of the domain size and the stochastic complexity. In experimental results, we confirm that the learning and the occurrence of optimal empirical sequence depend on the randomness and show the fact that in early stages the randomness is mainly characterized by the domain size and as the number of time steps increases the randomness depends greatly on the complexity of Markov decision processes.

  • An Approach to the Piano Mover's Problem Using Hierarchic Reinforcement Learning

    Yuko ISHIWAKA  Tomohiro YOSHIDA  Hiroshi YOKOI  Yukinori KAKAZU  

     
    PAPER-Distributed Cooperation and Agents

      Vol:
    E87-D No:8
      Page(s):
    2106-2113

    We attempt to achieve corporative behavior of autonomous decentralized agents constructed via Q-Learning, which is a type of reinforcement learning. As such, in the present paper, we examine the piano mover's problem including a find-path problem. We propose a multi-agent architecture that has an external agent and internal agents. Internal agents are homogenous and can communicate with each other. The movement of the external agent depends on the composition of the actions of the internal agents. By learning how to move through the internal agents, avoidance of obstacles by the object is expected. We simulate the proposed method in a two-dimensional continuous world. Results obtained in the present investigation reveal the effectiveness of the proposed method.

  • Multiagent Cooperating Learning Methods by Indirect Media Communication

    Ruoying SUN  Shoji TATSUMI  Gang ZHAO  

     
    PAPER-Neural Networks and Bioengineering

      Vol:
    E86-A No:11
      Page(s):
    2868-2878

    Reinforcement Learning (RL) is an efficient learning method for solving problems that learning agents have no knowledge about the environment a priori. Ant Colony System (ACS) provides an indirect communication method among cooperating agents, which is an efficient method for solving combinatorial optimization problems. Based on the cooperating method of the indirect communication in ACS and the update policy of reinforcement values in RL, this paper proposes the Q-ACS multiagent cooperating learning method that can be applied to both Markov Decision Processes (MDPs) and combinatorial optimization problems. The advantage of the Q-ACS method is for the learning agents to share episodes beneficial to the exploitation of the accumulated knowledge and utilize the learned reinforcement values efficiently. Further, taking the visited times into account, this paper proposes the T-ACS multiagent learning method. The merit of the T-ACS method is that the learning agents share better policies beneficial to the exploration during agent's learning processes. Meanwhile, considering the Q-ACS and the T-ACS as homogeneous multiagent learning methods, in the light of indirect media communication among heterogeneous multiagent, this paper presents a heterogeneous multiagent RL method, the D-ACS that composites the learning policy of the Q-ACS and the T-ACS, and takes different updating policies of reinforcement values. The agents in our methods are given a simply cooperating way exchanging information in the form of reinforcement values updated in the common model of all agents. Owning the advantages of exploring the unknown environment actively and exploiting learned knowledge effectively, the proposed methods are able to solve both problems with MDPs and combinatorial optimization problems effectively. The results of experiments on hunter game and traveling salesman problem demonstrate that our methods perform competitively with representative methods on each domain respectively.

  • Does Reinforcement Learning Simulate Threshold Public Goods Games?: A Comparison with Subject Experiments

    Atsushi IWASAKI  Shuichi IMURA  Sobei H. ODA  Itsuo HATONO  Kanji UEDA  

     
    PAPER

      Vol:
    E86-D No:8
      Page(s):
    1335-1343

    This paper examines the descriptive power and the limitations of a simple reinforcement learning model (REL), comparing the simulation results with the results of an economic experiment employing human subjects. Agent-based computational economics and experimental economics are becoming increasingly popular as tools for economists. A new variety of learning model using games with a unique equilibrium is proposed and examined in both of the fields mentioned above. However, little attention is given to games with multiple equilibria. We examine threshold public goods games with two types of equilibria, where each player in a five-person group simultaneously contributes the public goods from her private endowments. In the experiments, we observe two patterns of the subjects' behavior: the cooperative and non-cooperative patterns. Our simulation results show that the REL reproduces the cooperative pattern, but does not reproduce the non-cooperative pattern. However, the results suggest that the REL does reproduce the non-cooperative pattern in terms of the agents' internal states. That implies that deterministic strategies would be required to reproduce the non-cooperative pattern in the games. We show an example of the REL with deterministic strategies.

  • An Intelligent Stock Trading System Based on Reinforcement Learning

    Jae Won LEE  Sung-Dong KIM  Jongwoo LEE  Jinseok CHAE  

     
    PAPER-Artificial Intelligence, Cognitive Science

      Vol:
    E86-D No:2
      Page(s):
    296-305

    This paper describes a stock trading system based on reinforcement learning, regarding the process of stock price changes as Markov decision process (MDP). The system adopts two popular reinforcement learning algorithms, temporal-difference (TD) and Q, for selecting stocks and optimizing trading parameters, respectively. Input features of the system are devised using technical analysis and value functions are approximated by feedforward neural networks. Multiple cooperative agents are used for Q-learning to efficiently integrate global trend prediction with local trading strategy. Agents communicate with others sharing training episodes and learned policies, while keeping the overall scheme of conventional Q-learning. Experimental results on the Korean stock market show that our trading system outperforms the market average and makes appreciable profits. Furthermore, we can find that our system is superior to a system trained by supervised learning in view of risk management.

  • Labeling Q-Learning in POMDP Environments

    Haeyeon LEE  Hiroyuki KAMAYA  Kenichi ABE  

     
    PAPER-Biocybernetics, Neurocomputing

      Vol:
    E85-D No:9
      Page(s):
    1425-1432

    This paper presents a new Reinforcement Learning (RL) method, called "Labeling Q-learning (LQ-learning)," to solve the partially obervable Markov Decision Process (POMDP) problems. Recently, hierarchical RL methods are widely studied. However, they have the drawback that the learning time and memory are exhausted only for keeping the hierarchical structure, though they wouldn't be necessary. On the other hand, our LQ-learning has no hierarchical structure, but adopts a new type of internal memory mechanism. Namely, in the LQ-learning, the agent percepts the current state by pair of observation and its label, and then, the agent can distinguish states, which look as same, but obviously different, more exactly. So to speak, at each step t, we define a new type of perception of its environment õt=(ot,θt), where ot is conventional observation, and θt is the label attached to the observation ot. Then the classical RL-algorithm is used as if the pair (ot,θt) serves as a Markov state. This labeling is carried out by a Boolean variable, called "CHANGE," and a hash-like or mod function, called Labeling Function (LF). In order to demonstrate the efficiency of LQ-learning, we will apply it to "maze problems" in Grid-Worlds, used in many literatures as POMDP simulated environments. By using the LQ-learning, we can solve the maze problems without initial knowledge of environments.

  • Learning of Virtual Words Utilized in Negotiation Process between Agents

    Hiroyuki IIZUKA  Keiji SUZUKI  Masahito YAMAMOTO  Azuma OHUCHI  

     
    PAPER

      Vol:
    E83-A No:6
      Page(s):
    1075-1082

    Agent-based simulations are expected to enable analysis of complex social phenomena. In such simulations, one of the important behaviors of the agents is negotiation. Throughout the negotiations, the agents can make complex interactions with each other. Therefore, the ability of agents to perform negotiation is important in simulations of artificial societies. In this paper, we focus on price negotiations, in which the two sides have opposing interests. In the conventional price negotiation model, the process consists of an alternate succession of directly presented offers and counter-offers exchanging the desired prices. As an extended price negotiation model, we introduce virtual words to mimic the negotiation techniques of humans for indirectly presenting the desired price. The process of the proposed negotiation model consists of an alternate succession of offers of desired price and counter-offers of a word. The words represent the degree of the agent's demand. We propose agents with reinforcement learning who can acquire the ability to distinguish words and use them to negotiate. As a result, we will show that the virtual words became meaningful in the process of negotiations between agents whose negotiating strategies are acquired by reinforcement leaning.

  • Controlling Multiple Cranes Using Multi-Agent Reinforcement Learning: Emerging Coordination among Competitive Agents

    Sachiyo ARAI  Kazuteru MIYAZAKI  Shigenobu KOBAYASHI  

     
    PAPER-Real Time Control

      Vol:
    E83-B No:5
      Page(s):
    1039-1047

    This paper describes the Profit-Sharing, a reinforcement learning approach which can be used to design a coordination strategy in a multi-agent system, and demonstrates its effectiveness empirically within a coil-yard of steel manufacture. This domain consists of multiple cranes which are operated asynchronously but need coordination to adjust their initial plans of task execution to avoid the collisions, which would be caused by resource limitation. This problem is beyond the classical expert's hand-coding methods as well as the mathematical analysis, because of scattered information, stochastically generated tasks, and moreover, the difficulties to transact tasks on schedule. In recent few years, many applications of reinforcement learning algorithms based on Dynamic Programming (DP), such as Q-learning, Temporal Difference method, are introduced. They promise optimal performance of the agent in the Markov decision processes (MDPs), but in the non-MDPs, such as multi-agent domain, there is no guarantee for the convergence of agent's policy. On the other hand, Profit-Sharing is contrastive with DP-based ones, could guarantee the convergence to the rational policy, which means that agent could reach one of the desirable status, even in non-MDPs, where agents learn concurrently and competitively. Therefore, we embedded Profit-Sharing into the operator of crane to acquire cooperative rules in such a dynamic domain, and introduce its applicability to the realistic world by means of comparing with RAP (Reactive Action Planner) model, encoded by expert's knowledge.

  • A Constructive Compound Neural Networks. II Application to Artificial Life in a Competitive Environment

    Jianjun YAN  Naoyuki TOKUDA  Juichi MIYAMICHI  

     
    PAPER-Artificial Intelligence, Cognitive Science

      Vol:
    E83-D No:4
      Page(s):
    845-856

    We have developed a new efficient neural network-based algorithm for Alife application in a competitive world whereby the effects of interactions among organisms are evaluated in a weak form by exploiting the position of nearest food elements into consideration but not the positions of the other competing organisms. Two online learning algorithms, an instructive ASL (adaptive supervised learning) and an evaluative feedback-oriented RL (reinforcement learning) algorithm developed have been tested in simulating Alife environments with various neural network algorithms. The constructive compound neural network algorithm FuzGa guided by the ASL learning algorithm has proved to be most efficient among the methods experimented including the classical constructive cascaded CasCor algorithm of [18],[19] and the fixed non-constructive fuzzy neural networks. Adopting an adaptively selected best sequence of feedback action period Δα which we have found to be a decisive parameter in improving the network efficiency, the ASL-guided FuzGa had a performance of an averaged fitness value of 541.8 (standard deviation 48.8) as compared with 500(53.8) for ASL-guided CasCor and 489.2 (39.7) for RL-guided FuzGa. Our FuzGa algorithm has also outperformed the CasCor in time complexity by 31.1%. We have elucidated how the dimensionless parameter food availability FA representing the intensity of interactions among the organisms relates to a best sequence of the feedback action period Δα and an optimal number of hidden neurons for the given configuration of the networks. We confirm that the present solution successfully evaluates the effect of interactions at a larger FA, reducing to an isolated solution at a lower value of FA. The simulation is carried out by thread functions of Java by ensuring the randomness of individual activities.

  • Strategy Acquisition for the Game "Othello" Based on Reinforcement Learning

    Taku YOSHIOKA  Shin ISHII  Minoru ITO  

     
    PAPER-Bio-Cybernetics and Neurocomputing

      Vol:
    E82-D No:12
      Page(s):
    1618-1626

    This article discusses automatic strategy acquisition for the game "Othello" based on a reinforcement learning scheme. In our approach, a computer player, which initially knows only the game rules, becomes stronger after playing several thousands of games against another player. In each game, the computer player refines the evaluation function for the game state, which is achieved by min-max reinforcement learning (MMRL). MMRL is a simple learning scheme that uses the min-max strategy. Since the state space of Othello is huge, we employ a normalized Gaussian network (NGnet) to represent the evaluation function. As a result, the computer player becomes strong enough to beat a player employing a heuristic strategy. This article experimentally shows that MMRL is better than TD(0) and also shows that the NGnet is better than a multi-layered perceptron, in our Othello task.

  • Learning the Balance between Exploration and Exploitation via Reward

    Tetsuya YOSHIDA  Koichi HORI  Shinichi NAKASUKA  

     
    PAPER

      Vol:
    E82-A No:11
      Page(s):
    2538-2545

    This paper proposes a new method to improve cooperation in concurrent systems within the framework of Multi-Agent Systems (MAS) by utilizing reinforcement learning. When subsystems work independently and concurrently, achieving appropriate cooperation among them is important to improve the effectiveness of the overall system. Treating subsystems as agents makes it easy to explicitly deal with the interactions among them since they can be modeled naturally as communication among agents with intended information. In our approach agents try to learn the appropriate balance between exploration and exploitation via reward, which is important in distributed and concurrent problem solving in general. By focusing on how to give reward in reinforcement learning, not the learning equation, two kinds of reward are defined in the context of cooperation between agents, in contrast to reinforcement learning within the framework of single agent. In our approach reward for insistence by individual agent contributes to facilitating exploration and reward for concession to other agents contributes to facilitating exploitation. Our cooperation method was examined through experiments on the design of micro satellites and the result showed that it was effective to some extent to facilitate cooperation among agents by letting agents themselves learn the appropriate balance between insistence and concession. The result also suggested the possibility of utilizing the relative magnitude of these rewards as a new control parameter in MAS to control the overall behavior of MAS.

  • RTP-Q: A Reinforcement Learning System with Time Constraints Exploration Planning for Accelerating the Learning Rate

    Gang ZHAO  Shoji TATSUMI  Ruoying SUN  

     
    PAPER-Artificial Intelligence and Knowledge

      Vol:
    E82-A No:10
      Page(s):
    2266-2273

    Reinforcement learning is an efficient method for solving Markov Decision Processes that an agent improves its performance by using scalar reward values with higher capability of reactive and adaptive behaviors. Q-learning is a representative reinforcement learning method which is guaranteed to obtain an optimal policy but needs numerous trials to achieve it. k-Certainty Exploration Learning System realizes active exploration to an environment, but, the learning process is separated into two phases and estimate values are not derived during the process of identifying the environment. Dyna-Q architecture makes fuller use of a limited amount of experiences and achieves a better policy with fewer environment interactions during identifying an environment by learning and planning with constrained time, however, the exploration is not active. This paper proposes a RTP-Q reinforcement learning system which varies an efficient method for exploring an environment into time constraints exploration planning and compounds it into an integrated system of learning, planning and reacting for aiming for the best of both methods. Based on improving the performance of exploring an environment, refining the model of the environment, the RTP-Q learning system accelerates the learning rate for obtaining an optimal policy. The results of experiment on navigation tasks demonstrate that the RTP-Q learning system is efficient.

61-72hit(72hit)